KOTONOHA and BCCWJ: Development of a Balanced Corpus of Contemporary Written Japanese
نویسنده
چکیده
The National Institute for Japanese Language (NIJL) has launched a long-term language corpus development initiative aiming at the development of a super-corpus called KOTONOHA, which is consisting of a multitude of independent corpora. Among the constituent corpora of KOTONOHA, the one that bears the most urgent need is a largescale balanced corpus of the present-day written Japanese. Construction of such a corpus is currently underway under the auspice of NIJL and financial support of grantin-aid for scientific research from the MEXT. This paper describes the basic design issues of the balanced corpus called BCCWJ. The BCCWJ consist of three component sub-corpora differing in the nature of statistical populations, i.e, ‘production’, ‘circulation’, and ‘non-population’ sub-corpora. The first two sub-corpora represent the production and reception aspects of published written Japanese, while the last subcorpus is aggregate of various mini corpora developed for specialized research and language planning purposes.
منابع مشابه
Constructing a Japanese Basic Named Entity Corpus of Various Genres
This paper introduces a Japanese Named Entity (NE) corpus of various genres. We annotated 136 documents in the Balanced Corpus of Contemporary Written Japanese (BCCWJ) with the eight types of NE tags defined by Information Retrieval and Extraction Exercise. The NE corpus consists of six types of genres of documents such as blogs, magazines, white papers, and so on, and the corpus contains 2,464...
متن کاملBCCWJ-DepPara: A Syntactic Annotation Treebank on the 'Balanced Corpus of Contemporary Written Japanese'
Paratactic syntactic structures are difficult to represent in syntactic dependency tree structures. As such, we propose an annotation schema for syntactic dependency annotation of Japanese, in which coordinate structures are separated from and overlaid on bunsetsu(base phrase unit)-based dependency. The schema represents nested coordinate structures, non-constituent conjuncts, and forward shari...
متن کاملSemantic Annotations in Japanese FrameNet: Comparing Frames in Japanese and English
Since 2008, the Japanese FrameNet (JFN, http://jfn.st.hc.keio.ac.jp/) project has been annotating the Balanced Corpus of Contemporary Written Japanese (BCCWJ), the first such corpus, officially released in October 2011. This paper reports annotation results of the book genre of BCCWJ (Ohara 2011, Ohara, Saito, Fujii & Sato 2011). Comparing the semantic frames needed to annotate BCCWJ with those...
متن کاملAn Approach toward Register Classification of Book Samples in the Balanced Corpus of Contemporary Written Japanese
Japanese books are usually classified into ten genres by Nippon Decimal Classification (NDC) based on their subject. However, this classification is sometimes insufficient for corpus studies which describe characteristics of the texts in the book. Here, we propose a method of classifying text samples taken from Japanese books into some registers and text types. Firstly, we discuss useful criter...
متن کاملDesign, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese
Compilation of a 100 million words balanced corpus called the Balanced Corpus of Contemporary Written Japanese (or BCCWJ) is underway at the National Institute for Japanese Language and Linguistics. The corpus covers a wide range of text genres including books, magazines, newspapers, governmental white papers, textbooks, minutes of the National Diet, internet text (bulletin board and blogs) and...
متن کامل